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O ; Abstract 

, We consider the problem of computing a lightest derivation of a global structure using 

' a set of weighted rules. A large variety of inference problems in AI can be formulated in 

this framework. We generalize A* search and heuristics derived from abstractions to a 
broad class of lightest derivation problems. We also describe a new algorithm that searches 
for lightest derivations using a hierarchy of abstractions. Our generalization of A* gives a 
new algorithm for searching AND / OR graphs in a bottom- up fashion. 



i We discuss how the algorithms described here provide a general architecture for ad- 

^ ' dressing the pipeline problem — the problem of passing information back and forth between 

various stages of processing in a perceptual system. We consider examples in computer vi- 
sion and natural language processing. We apply the hierarchical search algorithm to the 
problem of estimating the boundaries of convex objects in grayscale images and compare 
it to other search methods. A second set of experiments demonstrate the use of a new 
. compositional model for finding salient curves in images. 

(N 



1. Introduction 

We consider a class of problems defined by a set of weighted rules for composing structures 
into larger structures. The goal in such problems is to find a lightest (least cost) derivation 
of a global structure derivable with the given rules. A large variety of classical inference 
problems in AI can be expressed within this framework. For example the global structure 
might be a parse tree, a match of a deformable object model to an image, or an assignment 
of values to variables in a Markov random field. 

We define a lightest derivation problem in terms of a set of statements, a set of weighted 
rules for deriving statements using other statements and a special goal statement. In each 
case we are looking for the lightest derivation of the goal statement. We usually express a 
lightest derivation problem using rule "schemas" that implicitly represent a very large set 
of rules in terms of a small number of rules with variables. Lightest derivation problems 
are formally equivalent to search in AND/OR graphs (Nilsson, 1980), but we find that our 
formulation is more natural for the applications we are interested in. 

One of the goals of this research is the construction of algorithms for global optimization 
across many levels of processing in a perceptual system. As described below our algorithms 
can be used to integrate multiple stages of a processing pipeline into a single global opti- 
mization problem that can be solved efficiently. 
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Dynamic programming is a fundamental technique for designing efficient inference al- 
gorithms. Good examples are the Viterbi algorithm for hidden Markov models (Rabiner, 
1989) and chart parsing methods for stochastic context free grammars (Charniak, 1996). 
The algorithms described here can be used to speed up the solution of problems normally 
solved using dynamic programming. We demonstrate this for a specific problem, where the 
goal is to estimate the boundary of a convex object in a cluttered image. In a second set 
of experiments we show how our algorithms can be used to find salient curves in images. 
We describe a new model for salient curves based on a compositional rule that enforces 
long range shape constraints. This leads to a problem that is too large to be solved using 
classical dynamic programming methods. 

The algorithms we consider are all related to Dijkstra's shortest paths algorithm (DSP) 
(Dijkstra, 1959) and A* search (Hart, Nilsson, & Raphael, 1968). Both DSP and A* can be 
used to find a shortest path in a cyclic graph. They use a priority queue to define an order 
in which nodes are expanded and have a worst case running time of 0(M log A'^) where A'^ 
is the number of nodes in the graph and M is the number of edges. In DSP and A* the 
expansion of a node v involves generating all nodes u such that there is an edge from v to u. 
The only difference between the two methods is that A* uses a heuristic function to avoid 
expanding non-promising nodes. 

Knuth gave a generalization of DSP that can be used to solve a lightest derivation 
problem with cyclic rules (Knuth, 1977). We call this Knuth's lightest derivation algorithm 
(KLD). In analogy to Dijkstra's algorithm, KLD uses a priority queue to define an order in 
which statements are expanded. Here the expansion of a statement v involves generating 
all conclusions that can be derived in a single step using v and other statements already 
expanded. As long as each rule has a bounded number of antecedents KLD also has a worst 
case running time of 0{M log N) where N is the number of statements in the problem 
and M is the number of rules. Nilsson's AO* algorithm (1980) can also be used to solve 
lightest derivation problems. Although AO* can use a heuristic function, it is not a true 
generalization of A* — it does not use a priority queue, only handles acyclic rules, and can 
require 0{MN) time even when applied to a shortest path problem.^ In particular, AO* 
and its variants use a backward chaining technique that starts at the goal and repeatedly 
refines subgoals, while A* is a forward chaining algorithm.^ 

Klein and Manning (2003) described an A* parsing algorithm that is similar to KLD but 
can use a heuristic function. One of our contributions is a generalization of this algorithm 
to arbitrary lightest derivation problems. We call this algorithm A* lightest derivation 
(A*LD). The method is forward chaining, uses a priority queue to control the order in 
which statements are expanded, handles cyclic rules and has a worst case running time of 
0{M\ogN) for problems where each rule has a small number of antecedents. A*LD can be 
seen as a true generalization of A* to lightest derivation problems. For a lightest derivation 
problem that comes from a shortest path problem A*LD is identical to A*. 

Of course the running times seen in practice are often not well predicted by worst case 
analysis. This is specially true for problems that are very large and defined implicitly. For 
example, we can use dynamic programming to solve a shortest path problem in an acyclic 
graph in 0{M) time. This is better than the 0(M log A'') bound for DSP, but for implicit 

1. There are extensions that handle cychc rules (Jimenez & Torras, 2000). 

2. AO* is backward chaining in terms of the inference rules defining a lightest derivation problem. 
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graphs DSP can be much more efficient since it expands nodes in a best-first order. When 
searching for a shortest path from a source to a goal, DSP will only expand nodes v with 
d{v) < w* . Here d{v) is the length of a shortest path from the source to v, and w* is the 
length of a shortest path from the source to the goal. In the case of A* with a monotone and 
admissible heuristic function, h{v), it is possible to obtain a similar bound when searching 
implicit graphs. A* will only expand nodes v with d{v) + h{v) < w*. 

The running time of KLD and A*LD can be expressed in a similar way. When solving a 
lightest derivation problem, KLD will only expand statements v with d{v) < w*. Here d{v) 
is the weight of a lightest derivation for v, and w* is the weight of a lightest derivation of the 
goal statement. Furthermore, A*LD will only expand statements v with d{v) + h{v) < w*. 
Here the heuristic function, h{v), gives an estimate of the additional weight necessary for 
deriving the goal statement using a derivation of v. The heuristic values used by A*LD are 
analogous to the distance from a node to the goal in a graph search problem (the notion 
used by A*). We note that these heuristic values are significantly different from the ones 
used by AO*. In the case of AO* the heuristic function, h{v), would estimate the weight 
of a lightest derivation for v. 

An important difference between A*LD and AO* is that A*LD computes derivations 
in a bottom-up fashion, while AO* uses a top-down approach. Each method has advan- 
tages, depending on the type of problem being solved. For example, a classical problem in 
computer vision involves grouping pixels into long and smooth curves. We can formulate 
the problem in terms of finding smooth curves between pairs of pixels that are far apart. 
For an image with n pixels there are Q(n^) such pairs. A straight forward implementation 
of a top-down algorithm would start by considering these r2(n^) possibilities. A bottom- 
up algorithm would start with 0{n) pairs of nearby pixels. In this case we expect that a 
bottom-up grouping method would be more efficient than a top-down method. 

The classical AO* algorithm requires the set of rules to be acyclic. Jimenez and Torras 
(2000) extended the method to handle cyclic rules. Another top-down algorithm that can 
handle cyclic rules is described by Bonet and Geffner (2005). Hansen and Zilberstein (2001) 
described a search algorithm for problems where the optimal solutions themselves can be 
cyclic. The algorithms described in this paper can handle problems with cyclic rules but 
require that the optimal solutions be acyclic. We also note that AO* can handle rules with 
non-superior weight functions (as defined in Section 3) while KLD requires superior weight 
functions. A*LD replaces this requirement by a requirement on the heuristic function. 

A well known method for defining heuristics for A* is to consider an abstract or relaxed 
search problem. For example, consider the problem of solving a Rubik's cube in a small 
number of moves. Suppose we ignore the edge and center pieces and solve only the corners. 
This is an example of a problem abstraction. The number of moves necessary to put the 
corners in a good configuration is a lower bound on the number of moves necessary to solve 
the original problem. There arc fewer corner configurations than there are full configurations 
and that makes it easier to solve the abstract problem. In general, shortest paths to the 
goal in an abstract problem can be used to define an admissible and monotone heuristic 
function for solving the original problem with A*. 

Here we show that abstractions can also be used to define heuristic functions for A*LD. 
In a lightest derivation problem the notion of a shortest path to the goal is replaced by 
the notion of a lightest context, where a context for a statement u is a derivation of the 
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goal with a "hole" that can be filled in by a derivation of v. The computation of lightest 
abstract contexts is itself a lightest derivation problem. 

Abstractions are related to problem relaxations defined by Pearl (1984). While abstrac- 
tions often lead to small problems that are solved through search, relaxations can lead to 
problems that still have a large state space but may be simple enough to be solved in closed 
form. The definition of abstractions that we use for lightest derivation problems includes 
relaxations special case. 

Another contribution of our work is a hierarchical search method that we call HA*LD. 
This algorithm can effectively use a hierarchy of abstractions to solve a lightest derivation 
problem. The algorithm is novel even in the case of classical search (shortest paths) prob- 
lem. HA*LD searches for lightest derivations and contexts at every level of abstraction 
simultaneously. More specifically, each level of abstraction has its own set of statements 
and rules. The search for lightest derivations and contexts at each level is controlled by a 
single priority queue. To understand the running time of HA*LD, let w* be the weight of a 
lightest derivation of the goal in the original (not abstracted) problem. For a statement v 
in the abstraction hierarchy let d{v) be the weight of a lightest derivation for v at its level 
of abstraction. Let h{v) be the weight of a lightest context for the abstraction of v (defined 
at one level above v in the hierarchy). Let K be the total number of statements in the 
hierarchy with d{v) + h{v) < w* . HAL*D expands at most 2K statements before solving 
the original problem. The factor of two comes from the fact that the algorithm computes 
both derivations and contexts at each level of abstraction. 

Previous algorithms that use abstractions for solving search problems include meth- 
ods based on pattern databases (Culberson &: Schaeffer, 1998; Korf, 1997; Korf & Felner, 
2002), Hierarchical A* (HA*, HIDA*) (Holte, Perez, Zimmer, & MacDonald, 1996; Holte, 
Grajkowski, &; Tanner, 2005) and coarse-to-fine dynamic programming (CFDP) (Raphael, 
2001). Pattern databases have made it possible to compute solutions to impressively large 
search problems. These methods construct a lookup table of shortest paths from a node 
to the goal at all abstract states. In practice the approach is limited to tables that remain 
fixed over different problem instances, or relatively small tables if the heuristic must be 
recomputed for each instance. For example, for the Rubik's cube we can precompute the 
number of moves necessary to solve every corner configuration. This table can be used to 
define a heuristic function when solving any full configuration of the Rubik's cube. Both 
HA* and HIDA* use a hierarchy of abstractions and can avoid searching over all nodes at 
any level of the hierarchy. On the other hand, in directed graphs these methods may still 
expand abstract nodes with arbitrarily large heuristic values. It is also not clear how to 
generalize HA* and HIDA* to lightest derivation problems that have rules with more than 
one antecedent. Finally, CFDP is related to AO* in that it repeatedly solves ever more 
refined problems using dynamic programming. This leads to a worst case running time of 
0{NM). We will discuss the relationships between HA*LD and these other hierarchical 
methods in more detail in Section 8. 

We note that both A* search and related algorithms have been previously used to solve 
a number of problems that are not classical state space search problems. This includes the 
traveling salesman problem (Zhang & Korf, 1996), planning (Edelkamp, 2002), multiple 
sequence alignment (Korf, Zhang, Thayer, &: Hohwald, 2005), combinatorial problems on 
graphs (Felner, 2005) and parsing using context-free-grammars (Klein &; Manning, 2003). 
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The work by Bulitko, Sturtevant, Lu, and Yau (2006) uses a hierarchy of state-space ab- 
stractions for real-time search. 

1.1 The Pipeline Problem 

A major problem in artificial intelligence is the integration of multiple processing stages to 
form a complete perceptual system. We call this the pipeline problem. In general we have 
a concatenation of systems where each stage feeds information to the next. In vision, for 
example, we might have an edge detector feeding information to a boundary finding system, 
which in turn feeds information to an object recognition system. 

Because of computational constraints and the need to build modules with clean interfaces 
pipelines often make hard decisions at module boundaries. For example, an edge detector 
typically constructs a Boolean array that indicates weather or not an edge was detected 
at each image location. But there is general recognition that the presence of an edge at a 
certain location can depend on the context around it. People often see edges at places where 
the image gradient is small if, at higher cognitive level, it is clear that there is actually an 
object boundary at that location. Speech recognition systems try to address this problem 
by returning n-best lists, but these may or may not contain the actual utterance. We would 
like the speech recognition system to be able to take high-level information into account 
and avoid the hard decision of exactly what strings to output in its n-best list. 

A processing pipeline can be specified by describing each of its stages in terms of rules for 
constructing structures using structures produced from a previous stage. In a vision system 
one stage could have rules for grouping edges into smooth curves while the next stage could 
have rules for grouping smooth curves into objects. In this case we can construct a single 
lightest derivation problem representing the entire system. Moreover, a hierarchical set of 
abstractions can be applied to the entire pipeline. By using HA*LD to compute lightest 
derivations a complete scene interpretation derived at one level of abstraction guides all 
processing stages at a more concrete level. This provides a mechanism that enables coarse 
high-level processing to guide low-level computation. We believe that this is an important 
property for implementing efficient perceptual pipelines that avoid making hard decisions 
between processing stages. 

We note that the formulation of a complete computer vision system as a lightest deriva- 
tion problem is related to the work by Gcman, Potter, and Chi (2002), Tu, Chen, Yuille, 
and Zhu (2005) and Jin and Geman (2006). In these papers image understanding is posed 
as a parsing problem, where the goal is to explain the image in terms of a set of objects that 
are formed by the (possibly recursive) composition of generic parts. Tu et al. (2005) use 
data driven MCMC to compute "optimal" parses while Geman et al. (2002) and Jin and 
Geman (2006) use a bottom-up algorithm for building compositions in a greedy fashion. 
Neither of these methods are guaranteed to compute an optimal scene interpretation. We 
hope that HA*LD will provide a more principled computational technique for solving large 
parsing problems defined by compositional models. 

1.2 Overview 

We begin by formally defining lightest derivation problems in Section 2. That section also 
discusses dynamic programming and the relationship between lightest derivation problems 
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and AND/OR graphs. In Section 3 we describe Knuth's lightest derivation algorithm. In 
Section 4 we describe A*LD and prove its correctness. Section 5 shows how abstractions 
can be used to define mechanically constructed heuristic functions for A*LD. We describe 
HA*LD in Section 6 and discuss its use in solving the pipeline problem in Section 7. Sec- 
tion 8 discusses the relationship between HA*LD and other hierarchical search methods. In 
Sections 9 and 10 we present some experimental results. We conclude in Section 11. 

2. Lightest Derivation Problems 

Let S be a set of statements and i? be a set of inference rules of the following form, 
Ai = wi 



C = g{wi,...,Wn) 

Here the antecedents Ai and the conclusion C are statements in S, the weights Wi are 
non-negative real valued variables and 5 is a non-negative real valued weight function. For 
a rule with no antecedents the function g is simply a non-negative real value. Throughout 
the paper we also use Ai,. . . ,An -^g C to denote an inference rule of this type. 

A derivation of C is a finite tree rooted at a rule ^1, . . . , An —>g C with n children, where 
the i-th child is a derivation of Ai. The leaves of this tree are rules with no antecedents. 
Every derivation has a weight that is the value obtained by recursive application of the 
functions g along the derivation tree. Figure 1 illustrates a derivation tree. 

Intuitively a rule Ai,. . . ,An — >-g C says that if we can derive the antecedents Ai with 
weights Wi then we can derive the conclusion C with weight g{wi, . . . ,Wn)- The problem 
we are interested in is to compute a lightest derivation of a special goal statement. 

All of the algorithms discussed in this paper assume that the weight functions g as- 
sociated with a lightest derivation problem are non-decreasing in each variable. This is a 
fundamental property ensuring that lightest derivations have an optimal substructure prop- 
erty. In this case lightest derivations can be constructed from other lightest derivations. 

To facilitate the runtime analysis of algorithms we assume that every rule has a small 
number of antecedents. We use N to denote the number of statements in a lightest derivation 
problem, while M denotes the number of rules. For most of the problems we are interested 
in A'^ and M are very large but the problem can be implicitly defined in a compact way, 
by using a small number of rules with variables as in the examples below. We also assume 
that N < M since statements that are not in the conclusion of some rule are clearly not 
derivable and can be ignored. 

2.1 Dynamic Programming 

We say that a set of rules is acyclic if there is an ordering O of the statements in S such 
that for any rule with conclusion C the antecedents are statements that come before C in 
the ordering. Dynamic programming can be used to solve a lightest derivation problem if 
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Figure 1: A derivation of C is a tree of rules rooted at a rule r with conclusion C. The 
children of the root are derivations of the antecedents in r. The leafs of the tree 
are rules with no antecedents. 



the functions g in each rule are non-decreasing and the set of rules is acyclic. In this case 
lightest derivations can be computed sequentially in terms of an acyclic ordering O. At the 
i-th step a lightest derivation of the i-th statement is obtained by minimizing over all rules 
that can be used to derive that statement. This method takes 0{M) time to compute a 
lightest derivation for each statement in S. 

We note that for cyclic rules it is sometimes possible to compute lightest derivations by 
taking multiple passes over the statements. We also note that some authors would refer 
to Dijkstra's algorithm (and KLD) as a dynamic programming method. In this paper we 
only use the term when referring to algorithms that compute lightest derivations in a fixed 
order that is independent of the solutions computed along the way (this includes recursive 
implementations that use memoization) . 

2.2 Examples 

Rules for computing shortest paths from a single source in a weighted graph are shown 
in Figure 2. We assume that we arc given a weighted graph G = (y,E), where Wxy is a 
non-negative weight for each edge (x, y) & E and s is a distinguished start node. The first 
rule states that there is a path of weight zero to the start node s. The second set of rules 
state that if there is a path to a node x we can extend that path with an edge from x to 
y to obtain an appropriately weighted path to a node y. There is a rule of this type for 
each edge in the graph. A lightest derivation of path{x) corresponds to shortest path from 
s to X. Note that for general graphs these rules can be cyclic. Figure 3 illustrates a graph 
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(1) 



path{s) = 

(2) for each (x, y) G E, 
path{x) = w 



path{y) = w + Wxy 

Figure 2: Rules for computing shortest paths in a graph. 




Figure 3: A graph with two highhghted paths from s to 6 and the corresponding derivations 
using rules from Figure 2. 



and two different derivations of path{b) using the rules just described. These corresponds 
to two different paths from s to b. 

Rules for chart parsing are shown in Figure 4. We assume that we are given a weighted 
context free grammar in Chomsky normal form (Charniak, 1996), i.e., a weighted set of 
productions of the form X ^ s and X —?■ YZ where X, Y and Z are nonterminal symbols 
and s is a terminal symbol. The input string is given by a sequence of terminals (si, . . . , Sji). 
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(1) for each production X — )■ Sj, 



phrase{X, i,i + 1) = w{X — >■ Si) 

(2) for each production X YZ and l<i<j<k<n + l, 

phrase{Y,i,j) = wi 
phrase{Z,j, k) = W2 

phrase{X, i, k) = wi + W2 + ■w{X — )• YZ) 

Figure 4: Rules for parsing with a context free grammar. 

The first set of rules state that if the grammar contains a production X ^ Si then there is a 
phrase of type X generating the i-ih entry of the input with weight w{X —?■ Sj). The second 
set of rules state that if the grammar contains a production X — > YZ and there is a phrase 
of type Y from i to j and a phrase of type Z from j to k then there is an, appropriately 
weighted, phrase of type X from i to k. Let S be the start symbol of the grammar. The 
goal of parsing is to find a lightest derivation of phrase{S, l,n + 1). These rules are acyclic 
because when phrases are composed together they form longer phrases. 

2.3 AND/OR Graphs 

Lightest derivation problems are closely related to AND/OR graphs. Let S and Rhe a set 
of statements and rules defining a lightest derivation problem. To convert the problem to 
an AND/OR graph representation we can build a graph with a disjunction node for each 
statement in S and a conjunction node for each rule in R. There is an edge from each state- 
ment to each rule deriving that statement, and an edge from each rule to its antecedents. 
The leaves of the AND/OR graph are rules with no antecedents. Now derivations of a 
statement using rules in R can be represented by solutions rooted at that statement in the 
corresponding AND/OR graph. Conversely, it is also possible to represent any AND/OR 
graph search problem as a lightest derivation problem. In this case we can view each node 
in the graph as a statement in S and build an appropriate set of rules R. 

3. Knuth's Lightest Derivation 

Knuth (1977) described a generalization of Dijkstra's shortest paths algorithm that we call 
Knuth's lightest derivation (KLD). Knuth's algorithm can be used to solve a large class of 
lightest derivation problems. The algorithm allows the rules to be cyclic but requires that 
the weight functions associated with each rule be non-decreasing and superior. Specifically 
we require the following two properties on the weight function g in each rule, 

non-decreasing: if w'^ > w-i then g{wi, . . . ,w[, . . . ,Wn) > g{wi, ■ ■ ■ ,Wi, ■ ■ ■ ,Wn) 
superior: g{wi, • • • , Wn) > Wi 
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For example, 

g{xi, . . . ,.T„) = .Ti H h x„ 

gixi,. . . = max(xi, . . . ,a;„) 

are both non-decreasing and superior functions. 

Knuth's algorithm computes lightest derivations in non-decreasing weight order. Since 
we are only interested in a lightest derivation of a special goal statement we can often stop 
the algorithm before computing the lightest derivation of every statement. 

A weight assignment is an expression of the form (B = w) where B is a statement in S 
and is a non-negative real value. We say that the weight assignment {B = w) is derivable 
if there is a derivation of B with weight w. For any set of rules R, statement B, and weight 
w we write R h {B = w) the rules in R can be used to derive {B = w). Let £{B,R) be 
the infimum of the set of weights derivable for B, 

e{B, R) = mf{w : Rh {B = w)}. 

Given a set of rules R and a statement goal G S we are interested in computing a derivation 
of goal with weight £{goal,R). 

We define a bottom-up logic programming language in which we can easily express the 
algorithms we wish to discuss throughout the rest of the paper. Each algorithm is defined 
by a set of rules with priorities. We encode the priority of a rule by writing it along the 
line separating the antecedents and the conclusion as follows, 

Ai = wi 

p{wi,...,Wn) 

C = g{wi, . . .,Wn) 

We call a rule of this form a prioritized rule. The execution of a set of prioritized rules 
P is defined by the procedure in Figure 5. The procedure keeps track of a set S and a 
priority queue Q of weight assignments of the form {B = w). Initially S is empty and Q 
contains weight assignments defined by rules with no antecedents at the priorities given by 
those rules. We iteratively remove the lowest priority assignment {B = w) from Q. If B 
already has an assigned weight in S then the new assignment is ignored. Otherwise we add 
the new assignment to S and "expand it" — every assignment derivable from [B = w) and 
other assignments already in S using some rule in P is added to Q at the priority specified 
by the rule. The procedure stops when the queue is empty. 

The result of executing a set of prioritized rules is a set of weight assignments. Moreover, 
the procedure can implicitly keep track of derivations by remembering which assignments 
were used to derive an item that is inserted in the queue. 

Lemma 1. The execution of a finite set of prioritized rules P derives every statement that 
is derivable with rules in P. 

Proof. Each rule causes at most one item to be inserted in the queue. Thus eventually Q 
is empty and the algorithm terminates. When Q is empty every statement derivable by a 
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Procedure Run{P) 

1. 5^0 

2. Initialize Q with assignments defined by rules with no antecedents at their priorities 

3. while Q is not empty 

4. Remove the lowest priority element (-B = w) from Q 

5. \i B has no assigned weight in <S 

6. S^SU{{B = w)] 

7. Insert assignments derivable from [B = w) and other assignments in S using 
some rule in P into Q at the priority specified by the rule 

8. return S 

Figure 5: Running a set of prioritized rules. 

single rule using antecedents with weight in S already has a weight in S. This implies that 
every derivable statement has a weight in 5. □ 

Now we are ready to define Knuth's lightest derivation algorithm. The algorithm is 
easily described in terms of prioritized rules. 

Definition 1 (Knuth's lightest derivation). Let R be a finite set of non- decreasing and 
superior rules. Define a set of prioritized rules IC{R) by setting the priority of each rule in 
R to be the weight of the conclusion. KLD is given by the execution of K.{R). 

We can show that while running IC{R), if (B = w) is added to S then w = i{B,R). 
This means that all assignments in S represent lightest derivations. We can also show that 
assignments are inserted into S in non-decreasing weight order. If we stop the algorithm as 
soon as we insert a weight assignment for goal into <S we will expand all statements B such 
that 1{B,R) < £{goal,R) and some statements B such that 1{B,R) = i{goal,R). These 
properties follow from a more general result described in the next section. 

3.1 Implementation 

The algorithm in Figure 5 can be implemented to run in 0(M log iV) time, where N and 
M refer to the size of the problem defined by the prioritized rules P. 

In practice the set of prioritized rules P is often specified implicitly, in terms of a small 
number of rules with variables. In this case the problem of executing P is closely related to 
the work on logical algorithms described by McAUester (2002). 

The main difficulty in devising an efficient implementation of the procedure in Figure 5 
is in step 7. In that step we need to find weight assignments in S that can be combined 
with {B = w) to derive new weight assignments. The logical algorithms work shows how a 
set of inference rules with variables can be transformed into a new set of rules, such that 
every rule has at most two antecedents and is in a particularly simple form. Moreover, 
this transformation does not increase the number of rules too much. Once the rules are 
transformed their execution can be implemented efficiently using a hashtable to represent 
<S, a heap to represent Q and indexing tables that allow us to perform step 7 quickly. 
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Consider the second set of rules for parsing in Figure 4. These can be represented by 
a single rule with variables. Moreover the rule has two antecedents. When executing the 
parsing rules we keep track of a table mapping a value for j to statements phrase{Y,i, j) 
that have a weight in <S. Using this table we can quickly find statements that have a weight 
in S and can be combined with a statement of the form phrase {Z, j, k) . Similarly we keep 
track of a table mapping a value for j to statements phrase{Z, j,k) that have a weight in 
«S. The second table lets us quickly find statements that can be combined with a statement 
of the form phrase{Y,i,j). We refer the reader to (McAllester, 2002) for more details. 

4. A* Lightest Derivation 

Our A* lightest derivation algorithm (A*LD) is a generalization of A* search to lightest 
derivation problems that subsumes A* parsing. The algorithm is similar to KLD but it can 
use a heuristic function to speed up computation. Consider a lightest derivation problem 
with rules R and goal statement goal. Knuth's algorithm will expand any statement B 
such that £{B,R) < i{goal,R). By using a heuristic function A*LD can avoid expanding 
statements that have light derivations but are not part of a light derivation of goal. 

Let R he a set of rules with statements in E, and /i be a heuristic function assigning 
a weight to each statement. Here h{B) is an estimate of the additional weight required to 
derive goal using a derivation of B. We note that in the case of a shortest path problem this 
weight is exactly the distance from a node to the goal. The value i{B, R) + h{B) provides 
a figure of merit for each statement B. The A* lightest derivation algorithm expands 
statements in order of their figure of merit. 

We say that a heuristic function is monotone if for every rule Ai, . . . ,An -^y C in R 
and derivable weight assignments {A^ = Wi) we have, 

Wi + hiAi) <g{wi,...,Wn) + hiC). (1) 

This definition agrees with the standard notion of a monotone heuristic function for rules 
that come from a shortest path problem. We can show that if h is monotone and h{goal) = 
then h is admissible under an appropriate notion of admissibility. For the correctness of 
A*LD, however, it is only required that h be monotone and that h{goal) be finite. In this 
case monotonicity implies that the heuristic value of every statement C that appears in a 
derivation of goal is finite. Below we assume that h{C) is finite for every statement. If h{C) 
is not finite we can ignore C and every rule that derives C. 

Definition 2 (A* lightest derivation). Let R be a finite set of non- decreasing rules and h 

be a monotone heuristic function for R. Define a set of prioritized rules A{R) by setting 
the priority of each rule in R to be the weight of the conclusion plus the heuristic value, 
g{wi, . . . ,Wn) + h{C). A*LD is given by the execution of A{R). 

Now we show that the execution of A{R) correctly computes lightest derivations and 
that it expands statements in order of their figure of merit values. 

Theorem 2. During the execution of A{R), if {B = w) £ S then w = £{B, R). 

Proof. The proof is by induction on the size of S. The statement is trivial when 5 = 0. 
Suppose the statement was true right before the algorithm removed {B = W},) from Q and 
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added it to <S. The fact that {B = wi,) e Q implies that the weight assignment is derivable 
and thus Wh > £(B, R). 

Suppose r is a derivation of B with weight < w^. Consider the moment right before 
the algorithm removed {B = Wb) from Q and added it to S. Let Ai,. . . ,An -^g C be a 
rule in T such that the antecedents Ai have a weight in <S while the conclusion C does not. 
Let Wc = g{i{Ai, R), . . . , i{An, R)). By the induction hypothesis the weight of Ai in S is 
i{Ai, R). Thus (C = Wc) € Q at priority Wc + h{C). Let w'^. be the weight that T assigns to 
C. Since g is non-decreasing we know Wc < w'^. Since h is monotone Wc + h{C) < w'^^ + h{B). 
This follows by using the monotonicity condition along the path from C to S in T. Now 
note that Wc + h{C) < + h{B) which in turn implies that {B = Wb) is not the weight 
assignment in Q with minimum priority. □ 

Theorem 3. During the execution of A{R) statements are expanded in order of the figure 

of merit value i{B, R) + h{B). 

Proof. First we show that the minimum priority of Q does not decrease throughout the 
execution of the algorithm. Suppose {B = w) is an element in Q with minimum priority. 
Removing (B = w) from Q does not decrease the minimum priority. Now suppose we add 
(B = w) to S and insert assignments derivable from {B = w) into Q. Since h is monotone 
the priority of every assignment derivable from [B = w) is at least the priority of {B = w). 

A weight assignment {B = w) is expanded when it is removed from Q and added to S. 
By the last theorem w = £{B, R) and by the definition of A{R) this weight assignment was 
queued at priority i{B,R) + h{B). Since we removed {B = w) from Q this must be the 
minimum priority in the queue. The minimum priority does not decrease over time so we 
must expand statements in order of their figure of merit value. □ 

If we have accurate heuristic functions A*LD can be much more efficient than KLD. 
Consider a situation where we have a perfect heuristic function. That is, suppose h{B) 
is exactly the additional weight required to derive goal using a derivation of B. Now the 
figure of merit i(B,R) + h{B) equals the weight of a lightest derivation of goal that uses 
B. In this case A*LD will derive goal before expanding any statements that are not part 
of a lightest derivation of goal. 

The correctness KLD follows from the correctness of A*LD. For a set of non-decreasing 
and superior rules wc can consider the trivial heuristic function h{B) = 0. The fact that 
the rules are superior imply that this heuristic is monotone. The theorems above imply 
that Knuth's algorithm correctly computes lightest derivations and expands statements in 
order of their lightest derivable weights. 

5. Heuristics Derived from Abstractions 

Here we consider the case of additive rules — rules where the weight of the conclusion is 
the sum of the weights of the antecedents plus a non-negative value v called the weight of 
the rule. We denote such a rule by Ai,. . . ,An -^v C. The weight of a derivation using 
additive rules is the sum of the weights of the rules that appear in the derivation tree. 

A context for a statement S is a finite tree of rules such that if we add a derivation of 
B to the tree we get a derivation of goal. Intuitively a context for i? is a derivation of goal 
with a "hole" that can be filled in by a derivation of B (see Figure 6). 



165 



Felzenszwalb & McAllester 




Figure 6: A derivation of goal defines contexts for the statements that appear in the deriva- 
tion tree. Note how a context for C together with a rule Ai,A2,A3 C and 
derivations of Ai and A2 define a context for A3. 



For additive rules, each context has a weight that is the sum of weights of the rules in it. 
Let Rhe a set of additive rules with statements in S. For i? G S we define i{context{B), R) 
to be the weight of a lightest context for B. The value £{B,R) + I (context (B) , R) is the 
weight of a lightest derivation of goal that uses B. 

Contexts can be derived using rules in R together with context rules c{R) defined as 
follows. First, goal has an empty context with weight zero. This is captured by a rule with 
no antecedents ^0 context (goal). For each rule Ai,. . . ,An -^y C in R we put n rules in 
c{R). These rules capture the notion that a context for C and derivations of Aj for j ^ i 
define a context for Ai, 

context (C), Ai, . . . , ylj+i, . . . ,An — >t, context{Ai). 

Figure 6 illustrates how a context for C together with derivations of Ai and A2 and a rule 
yli,yl2,^3 — > C define a context for A3. 
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We say that a heuristic function h is admissible if h{B) < i{context{B), R). Admissible 
heuristic functions never over-estimate the weight of deriving goal using a derivation of a 
particular statement. The heuristic function is perfect if h{B) = i(context{B), R). Now we 
show how to obtain admissible and monotone heuristic functions from abstractions. 

5.1 Abstractions 

Let (S, R) be a lightest derivation problem with statements E and rules R. An abstraction 
of is given by a problem {T,',R') and a map a&s:S^S', such that for every rule 

Ai, . . . , An -^v C in R there is a rule abs{Ai), . . . , abs(An) -^v' abs{C) in R' with v' < v. 
Below we show how an abstraction can be used to define a monotone and admissible heuristic 
function for the original problem. 

We usually think of abs as defining a coarsening of S by mapping several statements 
into the same abstract statement. For example, for a parser abs might map a lexicalized 
nonterminal A?^-Phouse to the nonlexicalized nonterminal NP. In this case the abstraction 
defines a smaller problem on the abstract statements. Abstractions can often be defined in 
a mechanical way by starting with a map abs from S into some set of abstract statements 
E'. We can then "project" the rules in R from E into E' using abs to get a set of abstract 
rules. Typically several rules in R will map to the same abstract rule. We only need to 
keep one copy of each abstract rule, with a weight that is a lower bound on the weight of 
the concrete rules mapping into it. 

Every derivation in (E,i?) maps to an abstract derivation so we have £{abs{C), R') < 
£{C, R). If we let the goal of the abstract problem be abs{goal) then every context in (E, R) 
maps to an abstract context and we see that i{context{abs{C)), R') < £{context{C), R). 
This means that lightest abstract context weights form an admissible heuristic function, 



Now we show that this heuristic function is also monotone. 

Consider a rule Ai,... ,An -^v C in R and let {A^ = Wi)he weight assignments derivable 

using R. In this case there is a rule abs{Ai), . . . , abs (An) -^v' abs{C) in R' where v' < v 
and {abs{Ai) = w!j) is derivable using R' where w[ < Wi. By definition of contexts (in the 
abstract problem) we have. 



h{C) = e{context{abs{C)),R'). 



e{context{abs{Ai)),R') < v' + '^Wj + £{context{abs{C)), R'). 



Since v' < v and w', < Wj we have. 



£{context{abs{Ai)), R') < v + Wj + £{context{abs{C)), R'). 



Plugging in the heuristic function h from above and adding wi to both sides. 




which is exactly the monotonicity condition in equation (1) for an additive rule. 
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If the abstract problem defined by (S', R') is relatively small we can efficiently compute 
lightest context weights for every statement in S' using dynamic programming or KLD. 
We can store these weights in a "pattern database" (a lookup table) to serve as a heuristic 
function for solving the concrete problem using A*LD. This heuristic may be able to stop 
A*LD from exploring a lot of non-promising structures. This is exactly the approach that 
was used by Culberson and Schaeffer (1998) and Korf (1997) for solving very large search 
problems. The results in this section show that pattern databases can be used in the more 
general setting of lightest derivations problems. The experiments in Section 10 demonstrate 
the technique in a specific application. 

6. Hiercirchical A* Lightest Derivation 

The main disadvantage of using pattern databases is that we have to precompute context 
weights for every abstract statement. This can often take a lot of time and space. Here we 
define a hierarchical algorithm, HA*LD, that searches for lightest derivations and contexts 
in an entire abstraction hierarchy simultaneously. This algorithm can often solve the most 
concrete problem without fully computing context weights at any level of abstraction. 

At each level of abstraction the behavior HA*LD is similar to the behavior of A*LD 
when using an abstraction-derived heuristic function. The hierarchical algorithm queues 
derivations of a statement C at a priority that depends on a lightest abstract context for 
C. But now abstract contexts are not computed in advance. Instead, abstract contexts are 
computed at the same time we are computing derivations. Until we have an abstract context 
for C, derivations of C are "stalled". This is captured by the addition of context{ahs{C)) 
as an antecedent to each rule that derives C. 

We define an abstraction hierarchy with m levels to be a sequence of lightest deriva- 
tion problems with additive rules {T,k,Rk) for < k < rn — 1 with a single abstraction 
function abs. For < k < m — I the abstraction function maps onto S^^^i. We re- 
quire that {T,k+i, Rk+i) be an abstraction of (Sfc,i2fc) as defined in the previous section: 
if Ai,. . . ,An -^v C is in then there exists a rule abs{Ai), ... , abs{An) abs{C) in 
Rk+i with v' < V. The hierarchical algorithm computes lightest derivations of statements 
in Sfc using contexts from '^k+i to define heuristic values. We extend abs so that it maps 
to a most abstract set of statements containing a single element _L. Since abs is 
onto we have > That is, the number of statements decrease as we go up the 

abstraction hierarchy. We denote by abs^ the abstraction function from Sq to Y^^ obtained 
by composing abs with itself k times. 

We are interested in computing a lightest derivation of a goal statement goal G Eq. Let 
goalie = abskigoal) be the goal at each level of abstraction. The hierarchical algorithm is 
defined by the set of prioritized rules H in Figure 7. Rules labeled UP compute derivations 
of statements at one level of abstraction using context weights from the level above to define 
priorities. Rules labeled BASE and DOWN compute contexts in one level of abstraction 
using derivation weights at the same level to define priorities. The rules labeled STARTl 
and START2 start the inference by handling the most abstract level. 

The execution of Ti starts by computing a derivation and context for _L with STARTl 
and START2. It continues by deriving statements in S^-i using UP rules. Once the 
lightest derivation of goal^_i is found the algorithm derives a context for goal^_i with a 
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STARTl: 

± = 



START2: U 

context{l-) = 

BASE: goalk = w 

w 

context (goalie) = 

UP: context {abs{C)) = Wc 

Ai = wi 

-^n — Wn 

V + Wl^ \-Wn + Wc 

C = V + Wi-\ \-Wn 

DOWN: context (C) = Wc 
Ai = wi 

An = Wn 

V +Wc + Wi-\ \-Wn 

context{Ai) = v + Wc + wi + ■ ■ ■ + Wn — Wi 

Figure 7: Prioritized rules 7i defining HA*LD. BASE rules are defined for < A; < m — 1. 

UP and DOWN rules are defined for each rule Ai,...,An -^v C € Rk with 
< A: < m - 1. 



BASE rule and starts computing contexts for other statements in using DOWN rules. 

In general HA*LD interleaves the computation of derivations and contexts at each level of 
abstraction since the execution of V. uses a single priority queue. 

Note that no computation happens at a given level of abstraction until a lightest derivar 
tion of the goal has been found at the level above. This means that the structure of the 
abstraction hierarchy can be defined dynamically. For example, as in the CFDP algorithm, 
we could define the set of statements at each level of abstraction by refining the statements 
that appear in a lightest derivation of the goal at the level above. Here we assume a static 
abstraction hierarchy. 

For each statement C G with < k < m — 1 wc use i{C) to denote the weight of 
a lightest derivation for C using Rj. while i{context{C)) denotes the weight of a lightest 
context for C using R^. For the most abstract level we define ^(-L) = i {context (1.)) = 0. 
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Below we show that HA*LD correctly computes lightest derivations and lightest contexts 
at every level of abstraction. Moreover, the order in which derivations and contexts are 
expanded is controlled by a heuristic function defined as follows. For C G with < k < 
m — 1 define a heuristic value for C using contexts at the level above and a heuristic value 
for context {C) using derivations at the same level, 

h{C) = e{context{abs{C))), 
h{context{C)) = i{C). 

For the most abstract level we define h{l.) = h{context{l.)) = 0. Let a generalized statement 
<I> be cither an clement of for < A; < m or an expression of the form context{C) for 
C G S/j. We define an intrinsic priority for $ as follows. 

For C G Sfe, we have that p{context{C)) is the weight of a lightest derivation of goalie that 
uses C, while p{C) is a lower bound on this weight. 

The results from Sections 4 and 5 cannot be used directly to show the correctness of 
HA*LD. This is because the rules in Figure 7 generate heuristic values at the same time 
they generate derivations that depend on those heuristic values. Intuitively we must show 
that during the execution of the prioritized rules H, each heuristic value is available at an 
appropriate point in time. The next lemma shows that the rules in T-L satisfy a monotonicity 
property with respect to the intrinsic priority of generalized statements. Theorem 5 proves 
the correctness of the hierarchical algorithm. 

Lemma 4 (Monotonicity). For each rule $i, . . . , in the hierarchical algorithm, if 

the weight of each antecedent $i is and the weight of the conclusion is a then 

(a) the priority of the rule is a + /i(^). 

(b) a + h{^) >p{^i). 

Proof. For the rules START 1 and START2 the result follows from the fact that the rules 
have no antecedents and h{l.) = h{context{±)) = 0. 

Consider a rule labeled BASE with w = i{goalj^). To see (a) note that a is always zero 
and the priority of the rule is w = h{context{goal]^)). For (b) we note that p{goal]^) = 
£{goalj^) which equals the priority of the rule. 

Now consider a rule labeled UP with Wc = £{context{abs{C))) and Wi = £{Ai) for all i. 
For part (a) note how the priority of the rule is a + Wc and h{C) = Wc- For part (b) consider 
the first antecedent of the rule. We have h{context{abs{C))) = i{abs{C)) < i{C) < a, and 
p{context{abs{C))) = Wc + h{context{abs{C))) < a + Wc- Now consider an antecedent 
Ai. If abs{Ai) = ± then p{Ai) = Wi < a + Wc- If abs{Ai) ^ _L then we can show that 
h{Ai) = i{context{abs{Ai))) < Wc + a — wi. This implies that p{A.i) = Wi + h{Ai) < a + Wc- 

Finally consider a rule labeled DOWN with Wc = £{context{C)) and Wj = iiAj) for all 
j. For part (a) note that the priority of the rule is a + Wi and h{context{Ai)) = Wi. For part 
(b) consider the first antecedent of the rule. We have h{context{C)) = i{C) < v + J2j'^j 
and we see that p{context{C)) = Wc + h{C) < a + wi. Now consider an antecedent Aj. If 
abs{Aj) = _L then h{Aj) = and p{Aj) = wj < a + Wi. If abs(Aj) ^ ± we can show that 
h{Aj) < a + Wi — Wj. Hence p{Aj) = Wj + h{Aj) < a + Wi. □ 
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Theorem 5. The execution of % maintains the following invariants. 

1. If{<^ = w)eS then w = 

2. If = w) E Q then it has priority w + h{^). 

3. Ifp{^) < p{Q) then ($ = G S 
Here p{Q) denotes the smallest priority in Q. 

Proof. In the initial state of the algorithm S is empty and Q contains only (_L = 0) and 
{context{l.) = 0) at priority 0. For the initial state invariant 1 is true since S is empty; 
invariant 2 follows from the definition of /i(-L) and h{context{l-)):, and invariant 3 follows 
from the fact that p{Q) = and > for all Let S and Q denote the state of the 
algorithm immediately prior to an iteration of the while loop in Figure 5 and suppose the 
invariants are true. Let S' and Q' denote the state of the algorithm after the iteration. 

We will first prove invariant 1 for <S'. Let ($ = w) be the element removed from Q in 
this iteration. By the soundness of the rules we have w > If w = £($) then clearly 

invariant 1 holds for S' . Uw > invariant 2 implies that p{Q) = w + h{^) > + 
and by invariant 3 we know that S contains (<& = In this case S' = S. 

Invariant 2 for Q' follows from invariant 2 for Q, invariant 1 for S', and part (a) of the 
monotonicity lemma. 

Finally, we consider invariant 3 for <S' and Q'. The proof is by reverse induction on the 

abstraction level of We say that $ has level A: if ^> G or ^> is of the form context{C) 
with C G Syfc. In the reverse induction, the base case considers $ at level m. Initially the 
algorithm inserts (_L = 0) and {context {.L) = 0) in the queue with priority 0. If p{Q') > 
then S' must contain (_L = 0) and {context{±) = 0). Hence invariant 3 holds for <S' and Q' 
with $ at level m. 

Now we assume that invariant 3 holds for <S' and Q' with $ at levels greater than k 
and consider level k. We first consider statements C G E^. Since the rules Rk are additive, 
every statement C derivable with Rk has a lightest derivation (a derivation with weight 
i{C)). This follows from the correctness of Knuth's algorithm. Moreover, for additive 
rules, subtrees of lightest derivations are also lightest derivations. We show by structural 
induction that for any lightest derivation with conclusion C such that p{C) < p{Q!) we 
have (C = 1{C)) G <S'. Consider a lightest derivation in R^ with conclusion C such that 
p{C) < p{Q'). The final rule in this derivation Ai, . . . , An — s-t, C corresponds to an UP rule 
where we add an antecedent for context{abs{C)). By part (b) of the monotonicity lemma 
all the antecedents of this UP rule have intrinsic priority less than p{Q'). By the induction 
hypothesis on lightest derivations we have {A^ = £{Ai)) G <S'. Since invariant 3 holds for 
statements at levels greater than k we have {context {ah s{C)) = i{context{abs{C)))) G <S'. 
This implies that at some point the UP rule was used to derive (C = i{C)) at priority p{C). 
But p{C) < p{Q!) and hence this item must have been removed from the queue. Therefore 
S' must contain {C = w) for some w and, by invariant 1, w = i{C). 

Now we consider $ of the form context{C) with C G S^. As before we see that c{Rk) is 
additive and thus every statement derivable with c{Rk) has a lightest derivation and subtrees 
of lightest derivations are lightest derivations themselves. We prove by structural induction 
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that for any lightest derivation T with conclusion context{C) such that p{context{C)) < 
p{Q') we have {context{C) = £{conteod{C))) G S'. Suppose the last rule of T is of the form, 

context (C), Ai, . . . , Ai^i, Ai^i, . . . ,An -^v context (Ai). 

This rule corresponds to a DOWN rule where we add an antecedent for Ai. By part (b) of 
the monotonicity lemma all the antecedents of this DOWN rule have intrinsic priority less 
than p{Q'). By invariant 3 for statements in and by the induction hypothesis on lightest 
derivations using c{Rk), all antecedents of the DOWN rule have their lightest weight in 
<S'. So at some point {context{Ai) = i{context{Ai))) was derived at priority p{Ai). Now 
p{Ai) < p{Q!) implies the item was removed from the queue and, by invariant 1, we have 
{conteod{Ai) = £{context{Ai))) G S'. 

Now suppose the last (and only) rule in T is —?-q context (goalie). This rule corre- 
sponds to a BASE rule where we add goalf. as an antecedent. Note that p{goalj?) = 
i{goalj.) = p{context{goalf^)) and hence p{goal^) < p{Q'). By invariant 3 for statements 
in Sfc we have {goal^. = i(goalf^)) in S' and at some point the BASE rule was used to queue 
{context {goalie) = i{context{goal j,))) at priority p{context{goalj?j). As in the previous cases 
p(context{goal}^)) < p{Q') implies {context {goal/,) = l{context{goali?))) G S' . □ 

The last theorem implies that generalized statements ^ are expanded in order of their 
intrinsic priority. Let K be the number of statements C in the entire abstraction hierarchy 
with p{C) < p{goal) = £{goal). For every statement C we have that p{C) < p{context{C)) . 
We conclude that HA*LD expands at most 2K generalized statements before computing a 
lightest derivation of goal. 



6.1 Example 

Now we consider the execution of HA*LD in a specific example. The example illustrates 
how HA*LD interleaves the computation of structures at different levels of abstraction. 
Consider the following abstraction hierarchy with 2 levels. 

So = {Xi, . . . Yi, . . . . . . ,Zn.,goalQ}, Si = {X,Y, Z, goali}, 



Ro = < 



Xi, Yj 
Xi, Yi ■ 



goalo, 



goalQ, 



Ri = < 



-^1 X, 

X,Y goal^, 
X,Y^5 Z, 
Z — 7>i goali, 



with abs{Xi) = X, abs{Yi) = Y, abs{Zi) = Z and abs{goalo) = goali. 

1. Initially 5 = and Q = {(X = 0) and [context{l.) = 0) at priority 0}. 

2. When (± = 0) comes off the queue it gets put in S but nothing else happens. 

3. When {context{l.) = 0) comes off the queue it gets put in S. Now statements in Si 
have an abstract context in S. This causes UP rules that come from rules in Ri with 
no antecedents to "fire", putting (X = 1) and (y = 1) in Q at priority 1. 
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4. When {X = 1) and {Y = 1) come off the queue they get put in 5, causing two UP 
rules to fire, putting {goali = 3) at priority 3 and {Z = 7) at priority 7 in the queue. 

5. We have, 

5 = {(± = 0), {context{l.) = 0), {X = 1), {Y = 1)} 
Q = {{goali = 3) at priority 3, {Z = 7) at priority 7} 

6. At this point {goali = 3) comes off the queue and goes into in S. A BASE rule fires 
putting (context (goali) — 0) the queue at priority 3. 

7. {context{goali) = 0) comes off the queue. This is the base case for contexts in Si. Two 
DOWN rules use [context {goali) = 0), {X = 1) and {Y = 1) to put {context{X) = 2) 
and {context {Y) = 2) in Q at priority 3. 

8. {context {X) = 2) comes off the queue and gets put in S. Now we have an abstract 
context for each Xj G Sqj so UP rules to put {Xi = i) in Q at priority i + 2. 

9. Now {context {Y) = 2) comes off the queue and goes into S. As in the previous step 
UP rules put (YJ = i) in Q at priority i + 2. 

10. We have, 

S = {(_L = 0), {context{±) = 0), {X = 1), {Y = 1), {goali = 3), 
{context{goali) = 0), {context{X) = 2), {context{Y) = 2)} 

Q = {{Xi = i) and {Yi = i) at priority i + 2 for 1 < i < ra, (Z = 7) at priority 7} 

11. Next both {Xi = 1) and (Yi = 1) will come off the queue and go into S. This causes 
an UP rule to put {goal^ = 3) in the queue at priority 3. 

12. {goalo = 3) comes off the queue and goes into S. The algorithm can stop now since 
we have a derivation of the most concrete goal. 

Note how HA*LD terminates before fully computing abstract derivations and contexts. 
In particular {Z = 7) is in Q but Z was never expanded. Moreover context{Z) is not even 
in the queue. If we keep running the algorithm it would eventually derive context {Z), and 
that would allow the Zi to be derived. 

7. The Perception Pipeline 

Figure 8 shows a hypothetical run of the hierarchical algorithm for a processing pipeline 

of a vision system. In this system weighted statements about edges are used to derive 
weighted statements about contours which provide input to later stages ultimately resulting 
in statements about recognized objects. 

It is well known that the subjective presence of edges at a particular image location can 
depend on the context in which a given image patch appears. This can be interpreted in 
the perception pipeline by stating that higher level processes — those later in the pipeline 
— influence low-level interpretations. This kind of influence happens naturally in a lightest 
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Figure 8: A vision system with several levels of processing. Forward arrows represent the 
normal flow of information from one stage of processing to the next. Backward 
arrows represent the computation of contexts. Downward arrows represent the 
influence of contexts. 



derivation problem. For example, the lightest derivation of a complete scene analysis might 
require the presence of an edge that is not locally apparent. By implementing the whole 
system as a single lightest derivation problem we avoid the need to make hard decisions 
between stages of the pipeline. 

The influence of late pipeline stages in guiding earlier stages is pronounced if we use 
HA*LD to compute lightest derivations. In this case the influence is apparent not only 
in the structure of the optimal solution but also in the flow of information across different 
stages of processing. In HA*LD a complete interpretation derived at one level of abstraction 
guides all processing stages at a more concrete level. Structures derived at late stages of 
the pipeline guide earlier stages through abstract context weights. This allows the early 
processing stages to concentrate computational efforts in constructing structures that will 
likely be part of the globally optimal solution. 

While we have emphasized the use of admissible heuristics, we note that the A* archi- 
tecture, including HA*LD, can also be used with inadmissible heuristic functions (of course 
this would break our optimality guarantees). Inadmissible heuristics are important because 
admissible heuristics tend to force the first few stages of a processing pipeline to generate 
too many derivations. As derivations are composed their weights increase and this causes a 
large number of derivations to be generated at the first few stages of processing before the 
first derivation reaches the end of the pipeline. Inadmissible heuristics can produce behavior 
similar to beam search — derivations generated in the first stage of the pipeline can flow 
through the whole pipeline quickly. A natural way to construct inadmissible heuristics is to 
simply "scale-up" an admissible heuristic such as the ones obtained from abstractions. It is 
then possible to construct a hierarchical algorithm where inadmissible heuristics obtained 
from one level of abstraction are used to guide search at the level below. 

8. Other Hierarchical Methods 

In this section we compare HA*LD to other hierarchical search methods. 
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8.1 Coarse-to-Fine Dynamic Programming 

HA*LD is related to the coarse-to-fine dynamic programming (CFDP) method described by 
Raphael (2001). To understand the relationship consider the problem of finding the shortest 
path from s to t in a trellis graph like the one shown in Figure 9(a). Here we have k columns 
of n nodes and every node in one column is connected to a constant number of nodes in 
the next column. Standard dynamic programming can be used to find the shortest path 
in 0{kn) time. Both CFDP and HA*LD can often find the shortest path much faster. On 
the other hand the worst case behavior of these algorithms is very different as we describe 
below, with CFDP taking significantly more time than HA*LD. 

The CFDP algorithm works by coarsening the graph, grouping nodes in each column 
into a small number of supcrnodcs as ilhistrated in Figure 9(b). The weight of an edge 
between two supernodcs A and B is the minimum weight between nodes a ^ A and b G B. 
The algorithm starts by using dynamic programming to find the shortest path P from s 
to t in the coarse graph, this is shown in bold in Figure 9(b). The supernodes along P 
are partitioned to define a finer graph as shown in Figure 9(c) and the procedure repeated. 
Eventually the shortest path P will only go through supernodes of size one, corresponding 
to a path in the original graph. At this point we know that P must be a shortest path from 
s to f in the original graph. In the best case the optimal path in each iteration will be a 
refinement of the optimal path from the previous iteration. This would result in O(logn) 
shortest paths computations, each in fairly coarse graphs. On the other hand, in the worst 
case CFDP will take 0(n) iterations to refine the whole graph, and many of the iterations 
will involve finding shortest paths in large graphs. In this case CFDP takes Q,{kn^) time 
which is much worst than the standard dynamic programming approach. 

Now suppose we use HA*LD to find the shortest path from s to t in a graph like the 
one in Figure 9(a). We can build an abstraction hierarchy with O(logn) levels where each 
supernode at level i contains 2* nodes from one column of the original graph. The coarse 
graph in Figure 9(b) represents the highest level of this abstraction hierarchy. Note that 
IIA*LD will consider a small number, O(logn), of predefined graphs while CFDP can end 
up considering a much larger number, $7(n), of graphs. In the best case scenario HA*LD 
will expand only the nodes that are in the shortest path from s to t at each level of the 
hierarchy. In the worst case HA*LD will compute a lightest path and context for every 
node in the hierarchy (here a context for a node v is a path from v to t). At the i-th 
abstraction level wc have a graph with 0(kn/2^) nodes and edges. HA*LD will spend at 
most 0{knlog{kn)/2^) time computing paths and contexts at level i. Summing over levels 
we get at most 0{knlog{kn)) time total, which is not much worst than the 0{kn) time 
taken by the standard dynamic programming approach. 

8.2 Hierarchical Heuristic Search 

Our hierarchical method is also related to the HA* and HIDA* algorithms described by 
Holte et al. (1996) and Holte et al. (2005). These methods are restricted to shortest paths 
problems but they also use a hierarchy of abstractions. A heuristic function is defined for 
each level of abstraction using shortest paths to the goal at the level above. The main 
idea is to run A* or IDA* to compute a shortest path while computing heuristic values on- 
demand. Let abs map a node to its abstraction and let g be the goal node in the concrete 
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graph. Whenever the heuristic value for a concrete node v is needed we call the algorithm 

recursively to find the shortest path from abs{v) to abs{g). This recursive call uses heuristic 
values defined from a further abstraction, computed through deeper recursive calls. 

It is not clear how to generalize HA* and HIDA* to lightest derivation problems that 
have rules with multiple antecedents. Another disadvantage is that these methods can 
potentially "stall" in the case of directed graphs. For example, suppose that when using 
HA* or HIDA* we expand a node with two successors x and y, where x is close to the goal 
but y is very far. At this point we need a heuristic value for x and y, and we might have 
to spend a long time computing a shortest path from abs{y) to abs{g). On the other hand, 
HA*LD would not wait for this shortest path to be fully computed. Intuitively HA*LD 
would compute shortest paths from abs{x) and abs{y) to abs{g) simultaneously. As soon 
as the shortest path from abs{x) to ahs{g) is found we can start exploring the path from x 
to g, independent of how long it would take to compute a path from abs{y) to ahs{g). 
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Figure 10: A convex set specified by a hypothesis (ro, . . . ,r7). 



9. Convex Object Detection 

Now wc consider an application of HA*LD to the problem of detecting convex objects in 
images. We pose the problem using a formulation similar to the one described by Raphael 
(2001), where the optimal convex object around a point can be found by solving a shortest 
path problem. We compare HA*LD to other search methods, including CFDP and A* 
with pattern databases. The results indicate that HA*LD performs better than the other 
methods over a wide range of inputs. 

Let x be a reference point inside a convex object. We can represent the object boundary 
using polar coordinates with respect to a coordinate system centered at x. In this case the 
object is described by a periodic function r{d) specifying the distance from x to the object 
boundary as a function of the angle 6. Here we only specify r{9) at a finite number of angles 
(^0) • • • ) On-i) and assume the boundary is a straight line segment between sample points. 
We also assume the object is contained in a ball of radius R around x and that r{9) is an 
integer. Thus an object is parametrized by (ro, . . . , r^-i) where G [0,R— 1]. An example 
with N = 8 angles is shown in Figure 10. 

Not every hypothesis (ro, . . . ,rjv_i) specifies a convex object. The hypothesis describes 
a convex set exactly when the object boundary turns left at each sample point {6i,ri) as 
i increases. Let C(rj_i, r^, rj+i) be a Boolean function indicating when three sequential 
values for r{9) define a boundary that is locally convex at i. The hypothesis (ro, . . . ,r;v-i) 
is convex when it is locally convex at each i.^ 

Throughout this section we assume that the reference point x is fixed in advance. Our 
goal is to find an "optimal" convex object around a given reference point. In practice 
reference locations can be found using a variety of methods such as a Hough transform. 

3. This parametrization of convex objects is similar but not identical to the one used by Raphael (2001). 
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Let D{i, rj, Tj-i-i) be an image data cost measuring the evidence for a boundary segment 
from {9i,ri) to (^i+i, We consider the problem of finding a convex object for which 

the sum of the data costs along the whole boundary is minimal. That is, we look for a 
convex hypothesis minimizing the following energy function, 



The data costs can be precomputed and specified by a lookup table with 0{NB?) entries. 
In our experiments we use a data cost based on the integral of the image gradient along 
each boundary segment. Another approach would be to use the data term described by 
Raphael (2001) where the cost depends on the contrast between the inside and the outside 
of the object measured within the pie-slice defined by 6i and Oi+i- 

An optimal convex object can be found using standard dynamic programming tech- 
niques. Let B(i,rQ,ri,ri-i,ri) be the cost of an optimal partial convex object starting at 
ro and ri and ending at rj_i and rj. Here we keep track of the last two boundary points to 
enforce the convexity constraint as we extend partial objects. We also have to keep track 
of the first two boundary points to enforce that rjsf = vq and the convexity constraint at tq. 
We can compute B using the recursive formula. 



where the minimization is over choices for rj_i such that C(rj_i, rj, rj+i) = true. The 
cost of an optimal object is given by the minimum value of B{N, ro, ri, rjv-i, ro) such that 
C(rAr_i, ro, ri) = true. An optimal object can be found by tracing-back as in typical dy- 
namic programming algorithms. The main problem with this approach is that the dynamic 
programming table has 0{NR'^) entries and it takes 0{R) time to compute each entry. The 
overall algorithm runs in 0{NR^) time which is quite slow. 

Now we show how optimal convex objects can be defined in terms of a lightest derivation 
problem. Let convex{i,ro,ri,ri-i,ri) denote a partial convex object starting at tq and ri 
and ending at rj_i and r^. This corresponds to an entry in the dynamic programming table 
described above. Define the set of statements. 



An optimal convex object corresponds to a lightest derivations of goal using the rules in 
Figure 11. The first set of rules specify the cost of a partial object from ro to ri. The 
second set of rules specify that an object ending at rj_i and r^ can be extended with a 
choice for rj+i such that the boundary is locally convex at rj. The last set of rules specify 
that a complete convex object is a partial object from ro to rjy such that = ro and the 
boundary is locally convex at ro- 

To construct an abstraction hierarchy wc define L nested partitions of the radius space 
[0, R — 1] into ranges of integers. In an abstract statement instead of specifying an integer 
value for r{9) we will specify the range in which r{9) is contained. To simplify notation we 



N-l 




-B(l,ro,ri,ro,ri) = D(0,ro,ri), 
B{i + l,ro,ri,ri,ri+i) = min ro, ri, r^-i, n) -|- r^+i) 



S = {convex{i, a, b,c,d) \ i E [1, N], a, b,c,d G [0, i? — 1]} U {goal}. 
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(1) for ro,ri e[0,R-l], 



convex{l,ro,ri,ro,ri) = D{0,ro,ri) 

(2) for rQ,ri,ri-i,ri,ri+i G [0,R- 1] such that C(rj_i, rj, rj+i) = true, 
convex{i,ro,ri,ri-i,ri) = w 

convex{i + 1, ro, n, r^, rj+i) = w + D{i,ri,ri+i) 

(3) for ro, n, TAT-i G [0, i? — 1] such that C(r7v-i, ^o, n) = true, 
convex{N,ro,ri,rN-i,ro) = w 



goal = w 

Figure 11: Rules for finding an optimal convex object. 

assume that i? is a power of two. The k-th partition P*^ contains ranges, each with 

2^ consecutive integers. The j-th range in is given by [j * 2*^, [j + 1) * 2*^ — 1]. 
The statements in the abstraction hierarchy are, 

Sfc = {convex(i, a, b,c,d) | i G [1, N], a, b,c,d G P^} U {goalj^}, 

for k G [0, L — 1]. A range in contains a single integer so Sq = S. Let / map a range 
in P'^ to the range in P'=+^ containing it. For statements in level A; < L — 1 we define the 
abstraction function, 

abs{convex{i,a,b,c,d)) = convex {i, f (a), f{b), f{c), f{d)), 
abs{goalk) = goal^+i- 

The abstract rules use bounds on the data costs for boundary segments between Sj) 
and (^i+i, Si+i) where Si and Sj+i are ranges in P^, 

D''{i,Si,Si+i) = min D{i,ri,ri+i). 
n G Si 

Ti+l G Si+l 

Since each range in P'^ is the union of two ranges in p^^^ one entry in can be computed 
quickly (in constant time) once D'^~^ is computed. The bounds for all levels can be com- 
puted in 0{NE?) time total. We also need abstract versions of the convexity constraints. 
For Sj_i,Sj,Si+i G P^, let C'^(sj_i, Sj, Sj+i) = true if there exist integers r-j-i, ri and rj+i 
in Sj-i, Si and Sj+i respectively such that C(ri_i, r^, rj+i) = true. The value of can be 
defined in closed form and evaluated quickly using simple geometry. 

The rules in the abstraction hierarchy are almost identical to the rules in Figure 11. 
The rules in level k are obtained from the original rules by simply replacing each instance 
of [0, P - 1] by P^ C by and D by . 
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Standard DP 


6718.6 seconds 


CFDP 


13.5 seconds 


HA*LD 


8.6 seconds 


A* with pattern database in S2 


14.3 seconds 


A* with pattern database in S3 


29.7 seconds 



Table 1: Running time comparison for the example in Figure 12. 
9.1 Experimental Results 

Figure 12 shows an example image with a set of reference locations that we selected manually 
and the optimal convex object found around each reference point. There are 14 reference 
locations and we used N = 30 and i? = 60 to parametrize each object. Table 1 compares the 
running time of different optimization algorithms we implemented for this problem. Each 
line shows the time it took to solve all 14 problems contained in the example image using 
a particular search algorithm. The standard DP algorithm uses the dynamic programming 
solution outlined above. The CFDP method is based on the algorithm by Raphael (2001) 
but modified for our representation of convex objects. Our hierarchical A* algorithm uses 
the abstraction hierarchy described here. For A* with pattern databases we used dynamic 
programming to compute a pattern database at a particular level of abstraction, and then 
used this database to provide heuristic values for A*. Note that for the problem described 
here the pattern database depends on the input. The running times listed include the time 
it took to compute the pattern database in each case. 

We see that CFDP, HA*LD and A* with pattern databases are much more efficient than 
the standard dynamic programming algorithm that does not use abstractions. HA*LD is 
slightly faster then the other methods in this example. Note that while the running time 
varies from algorithm to algorithm the output of every method is the same as they all find 
globally optimum objects. 

For a quantitative evaluation of the different search algorithms we created a large set of 
problems of varying difficulty and size as follows. For a given value of R we generated square 
images of width and height 2* R + 1. Each image has a circle with radius less than R near 
the center and the pixels in an image are corrupted by independent Gaussian noise. The 
difficulty of a problem is controlled by the standard deviation, a, of the noise. Figure 13 
shows some example images and optimal convex object found around their centers. 

The graph in Figure 14 shows the running time (in seconds) of the different search 
algorithms as a function of the noise level when the problem size is fixed at i? = 100. 
Each sample point indicates the average running time over 200 random inputs. The graph 
shows running times up to a point after which the circles can not be reliably detected. We 
compared HA*LD with CFDP and A* using pattern databases (PD2 and PD3). Here PD2 
and PD3 refer to A* with a pattern database defined in S2 and E3 respectively. Since the 
pattern database needs to be recomputed for each input there is a trade-off in the amount 
of time spent computing the database and the accuracy of the heuristic it provides. We 
see that for easy problems it is better to use a smaller database (defined at a higher level 
of abstraction) while for harder problems it is worth spending time computing a bigger 
database. HA*LD outperforms the other methods in every situation captured here. 
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Figure 13: Random images with circles and the optimal convex object around the center of 
each one (with N = 20 and R = 100). The noise level in the images is cr = 50. 



Figure 15 shows the running time of the different methods as a function of the problem 
size R, on problems with a fixed noise level of cr = 100. As before each sample point 
indicates the average running time taken over 200 random inputs. We see that the running 
time of the pattern database approach grows quickly as the problem size increases. This is 
because computing the database at any fixed level of abstraction takes 0{NR^) time. On 
the other hand the running time of both CFDP and HA*LD grows much slower. While 
CFDP performed essentially as well as HA*LD in this experiment, the graph in Figure 14 
shows that HA*LD performs better as the difHculty of the problem increases. 

10. Finding Salient Curves in Images 

A classical problem in computer vision involves finding salient curves in images. Intuitively 
the goal is to find long and smooth curves that go along paths with high image gradient. 
The standard way to pose the problem is to define a saliency score and search for curves 
optimizing that score. Most methods use a score defined by a simple combination of local 
terms. For example, the score usually depends on the curvature and the image gradient at 
each point of a curve. This type of score can often be optimized efficiently using dynamic 
programming or shortest paths algorithms (Montanari, 1971; Shashua & UUman, 1988; 
Basri & Alter, 1996; WiUiams & Jacobs, 1996). 

Here we consider a new compositional model for finding salient curves. An important 
aspect of this model is that it can captTirc global shape constraints. In particular, it looks 
for curves that are almost straight, something that can not be done using local constraints 
alone. Local constraints can enforce small curvature at each point of a curve, but this is 
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Noise 

Figure 14: Running time of different searcli algorithms as a function of the noise level a in 
the input. Each sample point indicates the average running time taken over 200 
random inputs. In each case = 20 and R = 100. See text for discussion. 



not enough to prevent curves from turning and twisting around over long distances. The 
problem of finding the most salient curve in an image with the compositional model defined 
here can be solved using dynamic programming, but the approach is too slow for practical 
use. Shortest paths algorithms are not applicable because of the compositional nature of 
the model. Instead we can use A*LD with a heuristic function derived from an abstraction 
(a pattern database). 

Let Ci be a curve with endpoints a and h and C2 be a curve with endpoints b and c. 
The two curves can be composed to form a curve C from a to c. We define the weight of the 
composition to be the sum of the weights of Ci and C2 plus a shape cost that depends on 
the geometric arrangement of points (a,6, c). Figure 16 illustrates the idea and the shape 
costs we use. Note that when Ci and C2 are long, the arrangement of their endpoints reflect 
non-local geometric properties. In general we consider composing Ci and C2 if the angle 
formed by ah and be is at least 7r/2 and the lengths of Ci and C2 are approximately equal. 
These constraints reduce the total number of compositions and play an important role in 
the abstract problem defined below. 

Besides the compositional rule we say that if a and b are nearby locations, then there is 
a short curve with endpoints a and b. This forms a base case for creating longer curves. We 
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Problem size (R) 

Figure 15: Running time of different search algorithms as a function of the problem size R. 

Each sample point indicates the average running time taken over 200 random 
inputs. In each case = 20 and a = 100. See text for discussion. 



b 




Figure 16: A curve with endpoints (a, c) is formed by composing curves with endpoints 
(a, 6) and (b,c). We assume that t > 7r/2. The cost of the composition is 
proportional to sin^(t). This cost is scale invariant and encourages curves to be 
relatively straight. 



assume that these short curves are straight, and their weight depends only on the image 
data along the line segment from a to b. We use a data term, seg{a,b), that is zero if the 
image gradient along pixels in ab is perpendicular to ab, and higher otherwise. 

Figure 17 gives a formal definition of the two rules in our model. The constants ki and 
k2 specify the minimum and maximum length of the base case curves, while L is a constant 
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(1) for pixels a, b, c where the angle between ab and be is at least 7r/2 and for < i < L, 

curve{a, b, i) = wi 
curveib, c, i) = W2 

curve{a, c,i + 1) = wi + W2 + shape{a, b, c) 

(2) for pixels a,b with ki < ||a — 6|| < k2, 



curve{a,b,0) = seg{a,b) 

Figure 17: Rules for finding "almost straight" curves between a pair of endpoints. Here L, 
ki and k2 are constants, while shape{a,b,c) is a function measuring the cost of 
a composition. 



controlling the maximum depth of derivations. A derivation of curve{a, b, i) encodes a curve 
from a to b. The value i can be seen as an approximate measure of arclength. A derivation 
of curve{a,b,i) is a full binary tree of depth i that encodes a curve with length between 
T^ki and 2'A;2. We let k2 = 2ki to allow for curves of any length. 

The rules in Figure 17 do not define a good measure of saliency by themselves because 
they always prefer short curves over long ones. Wc can define the saliency of a curve in 
terms of its weight minus its arclength, so that salient curves will be light and long. Let A 
be a positive constant. We consider finding the lightest derivation of goal using, 

curve{a, b,i) = w 

goal = w — A2' 

For an n X n image there are Q.{n^) statements of the form curve{a, c, i). Moreover, if a 
and c are far apart there are fl{n) choices for a "midpoint" b defining the two curves that 
are composed in a lightest derivation of curve(a, c, i). This makes a dynamic programming 
solution to the lightest derivation problem impractical. We have tried using KLD but even 
for small images the algorithm runs out of memory after a few minutes. Below we describe 
an abstraction we have used to define a heuristic function for A*LD. 

Consider a hierarchical set of partitions of an image into boxes. The i-th partition is 
defined by tiling the image into boxes of 2* x 2* pixels. The partitions form a pyramid with 
boxes of different sizes at each level. Each box at level i is the union of 4 boxes at the level 
below it, and the boxes at level are the pixels themselves. Let fi{a) be the box containing 
a in the i-th level of the pyramid. Now define 

abs(curve{a,b,i)) = curve{fi{a), fi{b),i). 
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Figure 18: The abstraction maps each curve statement to a statement about curves between 
boxes. 1{ i > j then curve{a,b,i) gets coarsened more than curve{c,d,j). Since 
hght curves are almost straight, i> j usually implies that ||a — 6|| > ||c — 



Figure 18 illustrates how this map selects a pyramid level for an abstract statement. In- 
tuitively abs defines an adaptive coarsening criteria. If a and b are far from each other, a 
curve from a to 6 must be long, which in turn implies that we map a and b to boxes in a 
coarse partition of the image. This creates an abstract problem that has a small number of 
statements without losing too much information. 

To define the abstract problem we also need to define a set of abstract rules. Recall 
that for every concrete rule r we need a corresponding abstract rule r' where the weight 
of r' is at most the weight of r. There are a small number of rules with no antecedents in 
Figure 17. For each concrete rule — >seg(a,6) curve{a, b, 0) we define a corresponding abstract 
rule, -^seg{a,b) abs{curve{a,b,0)). The compositional rules from Figure 17 lead to abstract 
rules for composing curves between boxes, 

curve{A, B,i), curve{B, C, i) — curve{A', C',i + 1), 

where A, B and C are boxes at the i-th pyramid level while A' and C are the boxes at 
level i + 1 containing A and C respectively. The weight v should be at most shape{a, b, c) 
where a, b and c are arbitrary pixels in A, B and C respectively. We compute a value for 
V by bounding the orientations of the line segments ab and be between boxes. 
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122 X 179 pixels. Running time: 65 seconds (43 + 22). 




226 X 150 pixels. Running time: 73 seconds (61 + 12). 



5ure 19: The most salient curve in different images. Tfie running time is tfie sum of 
the time spent computing the pattern database and the time spent solving the 
concrete problem. 
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The abstract problem defined above is relatively small even in large images, so we can 
use the pattern database approach outlined in Section 5.1. For each input image we use 
KLD to compute lightest context weights for every abstract statement. We then use these 
weights as heuristic values for solving the concrete problem with A*LD. Figure 19 illustrates 
some of the results we obtained using this method. It seems like the abstract problem is 
able to capture that most short curves can not be extended to a salient curve. It took 
about one minute to find the most salient curve in each of these images. Figure 19 lists the 
dimensions of each image and the running time in each case. 

Note that our algorithm does not rely on an initial binary edge detection stage. Instead 
the base case rules allow for salient curves to go over any pixel, even if there is no local 
evidence for a boundary at a particular location. Figure 20 shows an example where this 
happens. In this case there is a small part of the horse back that blends with the background 
if we consider local properties alone. 

The curve finding algorithm described in this section would be very difficult to formulate 
without A*LD and the general notion of heuristics derived from abstractions for lightest 
derivation problems. However, using the framework introduced in this paper it becomes 
relatively easy to specify the algorithm. 

In the future we plan to "compose" the rules for computing salient curves with rules for 
computing more complex structures. The basic idea of using a pyramid of boxes for defining 
an abstract problem should be applicable to a variety of problems in computer vision. 
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11. Conclusion 

Although we have presented some prehminary results in the last two sections, we view the 
main contribution of this paper as providing a general architecture for perceptual inference. 
Dijkstra's shortest paths algorithm and A* search arc both fundamental algorithms with 
many applications. Knuth noted the generalization of Dijkstra's algorithm to more general 
problems defined by a set of recursive rules. In this paper we have given similar gener- 
alizations for A* search and heuristics derived from abstractions. We have also described 
a new method for solving lightest derivation problems using a hierarchy of abstractions. 
Finally, we have outlined an approach for using these generalizations in the construction of 
processing pipelines for perceptual inference. 
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